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Abstract. A problem of improving the accuracy of nonparamctric entropy 
estimation for a stationary ergodic process is considered. New weak metrics 
are introduced and relations between metrics, measures, and entropy are dis- 
cussed. Based on weak metrics, a new nearest-neighbor entropy estimator is 
constructed and has a parameter with which the estimator is optimized to 
reduce its bias. It is shown that estimator's variance is upper-bounded by a 
nearly optimal Cramer-Rao lower bound. 

1. Introduction. This study is concerned with improving the accuracy of estima- 
tion of the entropy (entropy rate) of information sources with a finite state space, 
whose statistical characterization is unknown. Sequences of symbols or strings 
drawn from some finite alphabet appear in many applications where objects can 
be encoded into strings in natural ways. Such sequences are often viewed as real- 
izations of stochastic processes also known as "information sources" . An important 
quantity characterizing an information source is its entropy (entropy rate). For a 
comprehensive review of previous work on entropy estimation, see, for example, [6] . 
Most widely used are so-called "nonparamctric" entropy estimators. However, an 
analytical evaluation of the accuracy of those estimators is very difficult and few 
results are known. So, in most published work on nonparamctric entropy estima- 
tion, only an asymptotical convergence to entropy is proved and tested by computer 
simulation. 

For a given data sample of the size n, the most important characterization of 
an estimator h n is its efficiency (accuracy), or L 2 -error E(/i„ — h) 2 , where h is the 
entropy of the source and n is the number of observations. We recall the relation 

E(h n - h) 2 = Var h n + (Eh n - h) 2 , 

where the quantity E(/i„ — h) is called bias. 

Most known nonparametric entropy estimators are based either on Lempel-Ziv 
compression or the nearest-neighbor method, see, for instance, their review in [6]. 
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Those estimators are shown to converge almost everywhere(for most general results, 
see [6], [8]). Unfortunately, due to a slow convergence (less then 0(l/logn)), their 
accuracy is not good for many practical applications, with a relatively small sample 
size of log 2 n (log 2 n < 30 — 40). This motivates a search for estimators with a more 
rapid convergence. 

Lempcl-Ziv estimators are very hard to analyze and evaluate their accuracy ana- 
litically. For example, the initial motivation for paper [1] was the desire to obtain 
asymptotic properties for an entropy estimation algorithm due to Ziv [13], but they 
show that calculations of the bias and the variance is very difficult. Up to date, 
there are no published work on Lempel-Ziv estimator's bias or variance. 

From now on, we will focus on nearest-neighbor estimators and will now briefly 
state most important published results. We point out that it is often more conve- 
nient to estimate, instead of the entropy, its inverse quantity 1/h. 

Two modifications of a Grassberger's estimator[5] are proposed in [11]. In this 
paper notation, they are written as ri* ! ' m ^ ' (p) / 'log n (see 7) and T]n' m \p) (see 16), 
where p is a metric. 

For estimator ri' c ' m ''(p)/logn, L 1 -convergence and variance bound 0(n~ c ) are 
shown [11] under certain restriction on source measures. For metric 3 specifically, 
this measure restriction is relaxed (see 8) and convergence almost everywhere is 
established in [6]. It is also shown[6] that variance bound 0(n~ c ) holds for any c < 1. 

For estimator r)n (p)i -^-convergence is established in [11] under certain re- 
strictions on metrics and source measures. 

For metric 3 computer simulation [6] showed that the estimator 16 with metric 3 
is more efficient than the estimator r^' m \p)/ log n. But in a subsequent work [7] 
for symmetric Bernoulli measures, it was established that the estimator's bias is a 
periodic function, with a period proportional to log n. In a computer simulation, 
such a bias was difficult to catch because its amplitude was less than 10 -6 for 
sources with a small entropy (h < 3). 

In [12], the bias was also explicitly calculated for Markov measures and the met- 
ric 3. This bias was equal to zero if the logarithms of the transitional probabilities 
were rationally incommensurable. Otherwise, the bias was a periodic function with 
a period proportional to log n. This result demonstrates a new obstacle in an es- 
timator's analytical evaluation, namely, an estimator' bias can be a discontinues 
function of measure parameters. 

The objective of this research is to construct a new estimator based on an existing 
nearest-neighbor estimator and its modifications to achieve efficiency 0(n~ c ) for 
some measures, where c > is a constant. The main idea of this construction is 
as follows. A nearest-neighbor estimator is based on some metric. We introduce a 
wider class of so-called "weak" [2] metrics, for which the triangle inequality holds 
with some constant C > 1. The new estimator now has a parameter which is a 
non-decreasing function. We expect that the function can be selected so that to 
reduce the bias. Specifically, we introduce a class of functions with one parameter 
which we optimize to reduce the bias. It is shown that for symmetric Bernoulli 
measures there exists such a parameter value for which the bias is asymptotically 
zero. 

Our paper is organized as follows: 

• In Section 3, we introduce new weak metrics and discuss a connection between 
metrics, measures, and entropy. 
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• In Section 4, we discuss a nearest-neighbor statistic and show that the statis- 
tic^ variance is upper-bounded by 0(n _1 ) for a large class of measures and 
weak metrics, 

• In Section 5, we introduce our new nearest- neighbor estimator (based on the 
statistic of Section 4) and its modifications and prove that this estimator is 
unbiased for symmetric Bernoulli measures. 

2. Notation and Definitions. For our purposes, an information source, or sta- 
tionary process, is a shift-invariant ergodic measure p on the space O = A® of 
right-sided infinite sequences drawn from a finite alphabet A, where N = {1,2,...}. 
Thus, an infinite random sequence generated by p is viewed as a point in fi chosen 
randomly with respect to p and is denoted by £ = (£i, £2, • ■ •)■ 

For a cylinder centered at x S f2, s = 1, 2, . . . , we use the following notation 

C s (x) = {y G : yi =x 1: ...,y„ = x s )}. 

Let p be a metric on £1. We denote an open ball of radius r centered at x by 
B(x,r,p) = {y E il : p(x,y) < r}. In order to simplify the notation, it is 
convenient to write B(x, r) for B(x, r, p). 

Let £ = (£i,£2, • ■ •) be a point in Q chosen randomly with respect to p. Recall 
that the entropy h (entropy rate) of a measure p is defined as follows 

h=- \im -E\ogp(C n (0), (1) 

n— >oo Ji 

here and throughout the paper, all logarithms are to base e, i.e., natural. 

Problem Statement: 

Let p be a shift-invariant ergodic probability measure on f2 = A N . Let £o> £i) • • • 1 £n 
be independent random variables taking values in 17 and identically distributed with 
a common law p. We want to evaluate the entropy of the measure p. 

3. Metrics on Sequence Spaces. Let x = (xx,X2, ■ ■ ■) and y = (1/1,2/2, ■ • ■ ) be 
points in tt. We define the following metric on tt: 

p(ax,by) = { e _ A ^ lo ' g y ^. y)); (2) 

where X(t) is a nondecrcasing function such that A(0) = and \(t) < 1, < t < 00. 

In particular, for X(t) = 0, < t < 00, we obtain the following well-known 
metric: 

Po (x,y) =e-™f fc1 '*), (3) 

We stress that metric 2 is bi-Lipschitz equivalent to metric 3, i.e. we have 

Po(x, y) < p(x, y) < ep (x, y). (4) 

Therefore, according to [2], p is a weak metric (or near- metric) , i.e. the triangle 
inequality holds with some constant C > 1. 

While each point x has infinitely many coordinates, for any practical estimate 
calculations, we need to limit the number of coordinates which are used for calcula- 
tion. We make it by introducing a truncation of a metric that uses only the first m 
coordinates of the points. 
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We define p^ m > , a truncation of a metric p, as follows 
p^{ax,by) = 



e -y m -i)( ajjy ) j a = b . (5) 

e -A(-logpf»- ^ (*,„)) ) a ^ fe . 

To simplify the notation, we define: 

a(as > v) = -logp(aj J y)) ) aW(ai 1 ») = -log^)(z 1 y)). (6) 

Note that — 1/ log/Oo(a;, y)) is also a metric on f2, but l/a(x, y) is not a metric. 

Proposition 1. Let p be a shift-invariant ergodic measure on and p be metric 2; 
then, for p-almost all points x G 51 

lim kg/i(g(g,r,p)) = ^ 
i-s-o log r 

where h is the entropy of p. 

Proof. First, we consider a special case p = po. Balls in the metric po are cylinders; 
i.e. 

B(x,r,p ) = C n (x), e-"- 1 < r < e-". 

Therefore, we have 

log p{B(x,r, po)) = _ log/^C^a:)) 
i — >o logr n— >oo rt 

Applying Shannon-MacMillan-Brciman theorem [9, 2.10], we obtain, for //-almost 
all points x 6 SI 

Um log p{B(x,r, pq)) = _ ^ log p(C n {x)) = 

r— >0 log T n— >oo 71 

Now we consider a general metric 2. Since p is bi-Lipschitz equivalent (see 4) to 
a metric 3, we have 

B(x, e -1 r, po) c r, p) C B(x, er, p ). 

Hence, we have 

p{B(x,e~ 1 r,p Q )) < p(B(x,r,p)) < p(B(x, er, p a )). 
Therefore, for p- almost all points x G Jl, we obtain: 

Urn l0g ^ (g(a; ' r ' p)) = h. 
r-tO log r 

□ 

4. Nearest-neighbor statistics. In this section, we consider a nonparametric 
statistic r^ ,m ^ (p). This statistic is based on a sample of n + 1 independent points 
£o i •••)£» in the space f2 chosen randomly with respect to p and the metric p on £1 
and is defined as follows: 



ri fc '™) (p) = 1- V log (mm « p M (£ £ 

i " 

+ 1 «7=? 



J=0 



(7) 
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where p is a metric 2 and min^-fXi, . . . , X^} is denned by min^ fe ^{Xi, . . . , X^} = 
x k dX 1 <X 2 <---< x N . 

We stress that this statistic uses only first m coordinates of points £o, . . . , £ n . 

Theorem 1 and Proposition 8 of [11] imply the following statement: 

Proposition 2. Let . . . , £ n be n + l independent points in the space Q chosen 
randomly with respect to p, and k = O(logn), then the following limit holds: 

v Er^°°\p) 1 
hm — = -. 

n-Kx logn h 

Lemma 4.1. Let a measure p satisfy the following condition 

3a,b>0 : p(C n {x, r)) < be- an , Vn > 0, a.e.x e Q, (8) 
then, there exist constants c\ , C2 such that the following inequality holds: 

Erl fc,oo) (p) " Er^' m) (p) < cm" 1 , for m > c 2 logn. (9) 
Proof. Arguing as in proof of Theorem 1 [11], we obtain 



n — k 



1 - p(B(x, r, p< ra »)) d r p(B(x, r, p {m) ))dp{x). 



(10) 



From an identity 

B(x,r,p) = B{x,r,p {m) ), r > e~ m , 

we get 

(1 - /ii(.B(a;, r, p))) n ~ k d r p(B(x, r, p))dp{x). 
Using Condition 8 and 4, we obtain 

p(B{x,r,p))<c 3 r a . 

Therefore, we have 

E^ ,oo) (p) - Er£ fe ' m) (p) < -n fe / log r {c^f- 1 dr. 

Jo 

Calculating the above integral, we obtain 

£r^°°\p) - Er£' m \p) = O (n fc ?n e - m(a(fc - 1)+1) ) . 
If we set C2 > 1/a, then the inequality 9 follows from the above equality. □ 

Applying Proposition 2, we get the following statement: 
Corollary 1. For some constant c > 1/a, let the following relations hold: 

clogn < m, k = O(logn), 
then the following limit holds: 

p r ( k , m ) i 

hm = -. 

n-s-oo logn h 
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Theorem 4.2. Letr^' m ^ be a statistic defined in (7), then the following inequality 
holds 

_ 4(7i + l) v ; 

Proof. In this proof, we use McDiarmid's method [10]. 
We introduce a function 

/ : O n+1 -> K 

defined as 

1 ™ 

f(x ,...,x n ) = — -VmaxWaWfaii,^). 

n + 1 f-' 

In order to apply McDiarmid's method, we need to show that / satisfies the 
inequality 

sup |/(sco, •■•,x n ) - f(x ...,Xi-i,y,Xi+l,...,X n )\ < C», (12) 

for all < i < n. 

We prove this inequality for 

m(km + 1) , „ 

c< = — ^— L , 0<i<n. 13) 

71 + 1 

For brevity, we introduce the following notation 

X = (x , . . .,x n ), 
X = (xq, . . . , aJi_i, y, Xj+ij • • • i ^ra)) 
^(X) = maxWaW^,^). 

Let J = {j^ : g s (X) ? 9j {X)} . 
Since 

SiW = 9j(X),j $ J, j ^ i; 
\9j( x ) ~9j(X)\ < m,Vi; 

we have 

| /W -/ ( x)|< ra(|J| + 1) 



71+1 

Let us prove that | J| < fcm. 

If j e J then gj(X) = a^(xi,xj). 

Suppose that / = 1, 2, . . . , m is such that 

Xj£Ci(xi), x j eCi- 1 (x i ), 

then Vy € Ci(xi) 

a^(y,x j )>a^\y,x i ). 

Hence, 

This proves (13). 

As shown in [3], McDiarmid's martingale method provides the following bound 
on the variance of /(£o, ■ • ■ , £™) 

1 " 

Var [/«„,...,€»)]< - A Y,°*- ( 14 ) 

a=0 
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Substituting (13) into (14), we obtain (11). 

□ 

Corollary 2. 

P{\r { n k,m) {p) - Er£< m) (p)\ > 5} < 2e - 2 («+ 1 )< 52 /™ 2 ('™+i) 2 . ( 15 ) 
Proof. This inequality is obtained by applying the McDiarmid's inequality [10] 

p{\Mo, ...,€«)- e/(€o, ...,£«)!>*}< 2e~ 252/ E ?=° c - , 

where q is defined in (12). 

Substituting (13) into the above inequality, we get 

P{\m , ...,€»)- e/(€o, ...,€»)!>*}< 2e - 2 (" +1 ) A ' 2 /™ 2 ( fc ™ +1 ) 2 . 

Substituting r£ fc ' m) (p) = /(£ , • • • , In), we get 15. □ 
Corollary 3. Let a measure p satisfy 8 and an inequality clogn < m hold for some 

constant c > 1/a, m = O(logn), and k = O(logn), then a sequence 

logn 

converges to 1/h a.e. 

5. Entropy Estimator. In this section, we consider a nonparametric estima- 
tor T]n (p) for the inverse entropy 1/h, where metric p is defined in 2. 

This estimator is based on a sample of n + 1 independent points £o, • • ■ , £n in 
the space ft chosen randomly with respect to p and the metric 2 on fi and is dchncd 
as follows: 

V ( n k ' m) (p) = k (r^ m Hp) - rl fe+1 '™)M) , (16) 
where r^' mS> (p) is dchned in 7. 



Applying Theorem 4.2 and inequality Var (X + Y) < (yVar X + vVar YJ , we 
obtain the following statement: 

Proposition 3. Let m = O(logn) and k = O(logn), then 

Var ^'^(p) = Oin- 1 log 8 n). 

Thus, we have just calculated the estimator's variance. We note that a calculation 
of the estimator's bias is much more complicated and we will do it for a asymmetric 
Bernoulli measure. 

Proposition 4. Let p be a asymmetric Bernoulli measure and the function A(t) 
of 2 be such that the following identity holds 

A(i) = lo g/3 (/3 + (l-/3)/?') (17) 

forO <f3 < 1. 

Then, for (3 = 1/\A\, we have 

Er,i k >°°X!3) = l/h = l/log\A\. 

Proof. We introduce notation 

F(t)= f x(B(x,e- t )). 

Clearly, for a symmetric Bernoulli measure (with equiprobable symbols), F does 
not depend on x and satisfies an equation 

F(t)=(3F(t-l) + (l-^)F(X- 1 (t)), 
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where (3 = 1/\A\. It can easily be checked that F(t) = /3* is a solution of this 
equation. 

Substituting F(t) = /3* for m = oo in 10, we get 



Er^°°\p) = ~n(l_ J) J\ogrr ik - 1)losr (l - 0" logr ) 
If we replace /3~ logr by x, we obtain 

Er " fc,oo) (p) = (fc - !) i 1 iog x x(k ~ i] (i - 



fc d/r logr . 



Calculating the integral (see [4, 4.253.1]), we get 

" W log/3 
where H„ arc harmonic numbers 



s 

s=l 



Using 16, we obtain 



n 



log/3 



□ 



6. Conclusion. In this work, we have introduced a new nearest-neighbor entropy 
estimator which is based on a new large family of weak metrics. The estimator has a 
parameter with which it is optimized to reduce its bias. We have calculated estima- 
tor's variance and shown that it is upper-bounded by a nearly optimal Cramer-Rao 
lower bound. 

We have explicitly calculated the estimator's bias for a special case - symmetrical 
Bernoulli measures. In a subsequent work, we expect to calculate the bias for a 
general case as well as to develop an efficient estimator's algorithm implementation. 
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